========================================================

## [1] 1599   13
## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : Factor w/ 6 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol      quality
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40   3: 10  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50   4: 53  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20   5:681  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42   6:638  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10   7:199  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90   8: 18

Our dataset consists of 13 variables, with 1599 observations. There are no null values in all columns and we have changed the quality to be factor instead of integer and removed X column from the dataset. In the end we are left with 12 variables.

Univariate Plots Section

From the above, we can see that the quality of red wine data we have spanned from 3 to 8. We can also see that most of the red wine data we have have quality of 5 or 6. There are very small number of data where the quality is 8 or 3. We also try to have more detailed scale in the y direction to see how imbalance the quality data is. From there we can observed that there are around 40 wine data with the quality score 4.

From the above we can see that most of the red wine have pH of 3.2 to 3.4. Also, the pH plot above looks like a normal distribution. Moreover, there seems to be outliers for the pH level, those that are above 4 and below 2.75.

From the alcohol data we can see that most wine has alcohol percentage of 9%. We can also observed that most of the wine has density of 0.995 to 1 and that the density distribution looks like a normal distribution.

It seems that most of the data has citric acid of near 0. However, we also see quite a considerable amount around 0.5. The graph above looks right skewed, we may need to use log transformation later.

The most common fixed acidity value lies between 6 - 8, while the most common volatile acidity value lies between 0.4 to 0.8.

For residual sugar, the most common value lies between 1 - 3 g/dm^3. For sulphates, the most common value lies between 0.5 to 1 g/dm^3.

We can also see that there are several data with residual sugar measurement more than 8 and total sulfur dioxide of more than 200. These could potentially be an outlier.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

For free sulfur dioxide, the most commmon value lies between 0 to 15 mg/dm^3. For total sulfur dioxide, the most common value lies between 0 to 50 mg/dm^3. After that, we also create the variable free sulful dioxide percentage which is the free sulfur dioxide divided by the total sulfur dioxide. All the value are between 0 to 1.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

Most wine has chlorides of around 0.1 g/dm^3.

Univariate Analysis

What is the structure of your dataset?

Our dataset consists of 13 variables, with 1599 observations. All of the variables are number. - Most of the red wine has quality of 5 - 6. - Most of the variables, like chlorides, sulphates, volatile acidity seems to have normal distribution. - The citric acid variable seems to be skewed to the right. We may need to use log transformation later for it.

What is/are the main feature(s) of interest in your dataset?

The main features in the data set are alcohol and quality. I’d like to determine which features are best for predicting the quality of a red wine. I suspect alcohol and some combination of the other variables can be used to build a predictive model to determine the quality of red wine.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Some other variables that indicate the taste of the wine, such as acidity, sugar , and density are likely to affect the quality of the red wine. I think density and alcohol contribute most to the quality after researching information on wine quality.

Did you create any new variables from existing variables in the dataset?

I create free sulfur dioxide percentage that measure the percentage of free sulfur dioxide among all sulfur dioxide in the dataset.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

I checked each column to see if there is any null value, but apparently there is not. I also change the quality to be a factor instead of number.

Bivariate Plots Section

From the plot above, we can see that alcohol, sulphates, and volatile acidity seem to have weak correlation with quality. We can also clarify our previous findings that some of the variables have right skewed distribution.

However, it is interesting to see that there are strong correlation between other 2 variables, such as fixed acidity vs citric acid, volatile acidity vs citric acid, fixed acidity vs density, fixed acidity vs pH. Even though we are interested in only what factors correlate with the quality, it is also important to take note which pair of variables seems to be correlated, especially if later we would like to do a linear regression.

I want to look closer at the scatter plots involving quality and some other variable like alcohol, volatile acidity and sulphates.

Since it is typically not a good idea to use scatter plot for discrete data, we need to use jitter to avoid overplotting of the plot due to points in visualizatoins plotting on top of each other.

## 
## Call:
## lm(formula = as.numeric(quality) ~ alcohol, data = wineData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8442 -0.4112 -0.1690  0.5166  2.5888 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.12503    0.17471  -0.716    0.474    
## alcohol      0.36084    0.01668  21.639   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7104 on 1597 degrees of freedom
## Multiple R-squared:  0.2267, Adjusted R-squared:  0.2263 
## F-statistic: 468.3 on 1 and 1597 DF,  p-value: < 2.2e-16
## wineData$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.575  11.000 
## -------------------------------------------------------- 
## wineData$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## wineData$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## wineData$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## wineData$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## wineData$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00

After adding jitter to the plot, we can see that on most cases, the red wine alcohol quantity seems to correlate to the quality of the wine. The R-squared value shows that alcohol explain about 22.63 percent of the quality of the wine.

Let’s try the same thing for the volatile acidity

## wineData$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4400  0.6475  0.8450  0.8845  1.0100  1.5800 
## -------------------------------------------------------- 
## wineData$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.230   0.530   0.670   0.694   0.870   1.130 
## -------------------------------------------------------- 
## wineData$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.180   0.460   0.580   0.577   0.670   1.330 
## -------------------------------------------------------- 
## wineData$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1600  0.3800  0.4900  0.4975  0.6000  1.0400 
## -------------------------------------------------------- 
## wineData$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3000  0.3700  0.4039  0.4850  0.9150 
## -------------------------------------------------------- 
## wineData$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2600  0.3350  0.3700  0.4233  0.4725  0.8500

After setting the transparency, there seems to be a weak negative linear correlation between volatile acidity and quality. It is becoming more clear when we plot it as a boxplot. Nearly 50% of the wine with lowest quality seems to have higher volatile acidity than other red wine with greater quality. In fact, volatile acidity explains 15% of the quality of red wine. We can see that as quality increase, the mean and median of red wine belonging to that quality decrease. From the boxplot, since most of the value range overlap each other for different quality, volatile acidity may only be used to differentiate wine of quality 3 and wine of quality 7 to 8.

## 
## Call:
## lm(formula = as.numeric(quality) ~ sulphates, data = wineData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2432 -0.5424  0.1102  0.4456  2.3977 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.84775    0.07842   36.31   <2e-16 ***
## sulphates    1.19771    0.11539   10.38   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7819 on 1597 degrees of freedom
## Multiple R-squared:  0.0632, Adjusted R-squared:  0.06261 
## F-statistic: 107.7 on 1 and 1597 DF,  p-value: < 2.2e-16

We can also see correlation of sulphates vs quality and that the median of the sulphates content in the wine increase as quality increase. However, there seems to be considerable amount of outlier when the quality is 5 and 6. It may be because of the fact that we have a lot more data for wine of quality 5 and 6 or it could be an indication that sulphates might not be a strong indicator of quality. In fact, sulphates explains only about 6% of quality.

From the boxplot and scatterplot we draw above, it does not seem that free sulfur dioxide has linear correlation with quality.

Let’s try to draw the correlation between fixed acidity and density

## 
## Call:
## lm(formula = density ~ fixed.acidity, data = wineData)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0064452 -0.0007700  0.0000738  0.0009434  0.0055816 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.907e-01  1.716e-04 5774.70   <2e-16 ***
## fixed.acidity 7.242e-04  2.018e-05   35.88   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.001405 on 1597 degrees of freedom
## Multiple R-squared:  0.4463, Adjusted R-squared:  0.4459 
## F-statistic:  1287 on 1 and 1597 DF,  p-value: < 2.2e-16

It seems like there is weak linear relation between fixed acidity and density based on the graph. In fact, fixed acidity explains 45% of density.

We know in fact, at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine, while at low concentration, it is undetectable. Let’s find out how many sample we have had high concentration of SO2 and its summary for each of quality value.

##    Mode   FALSE    TRUE 
## logical    1583      16
## wineData$quality: 3
##    Mode   FALSE 
## logical      10 
## -------------------------------------------------------- 
## wineData$quality: 4
##    Mode   FALSE 
## logical      53 
## -------------------------------------------------------- 
## wineData$quality: 5
##    Mode   FALSE    TRUE 
## logical     672       9 
## -------------------------------------------------------- 
## wineData$quality: 6
##    Mode   FALSE    TRUE 
## logical     633       5 
## -------------------------------------------------------- 
## wineData$quality: 7
##    Mode   FALSE    TRUE 
## logical     197       2 
## -------------------------------------------------------- 
## wineData$quality: 8
##    Mode   FALSE 
## logical      18

It seems that in our sample data, only red wine of quality 5 - 7 have sample of wine with high concentration of S02. However, the number of cases itself is small compared to the sample size we have, thus it may not be wise to assume that in general, when there is high concentration of SO2, the red wine is of quality 5 - 7.

Let’s look at the correlation of fixed acidity vs pH

## 
## Call:
## lm(formula = pH ~ fixed.acidity, data = wineData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.51780 -0.06547  0.00164  0.06488  0.52207 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.814959   0.013776  276.93   <2e-16 ***
## fixed.acidity -0.060561   0.001621  -37.37   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1128 on 1597 degrees of freedom
## Multiple R-squared:  0.4665, Adjusted R-squared:  0.4661 
## F-statistic:  1396 on 1 and 1597 DF,  p-value: < 2.2e-16

It seems that there is weak negative correlation between pH and fixed acidity. It makes sense as when the wine contains more acids, it has less pH.

Let’s see the correlation between volatile acidity and alcohol to determine if we can use both variable to predict the quality of the wine.

## 
## Call:
## lm(formula = volatile.acidity ~ alcohol, data = wineData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3692 -0.1292 -0.0084  0.1007  1.0684 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.882094   0.043142  20.446  < 2e-16 ***
## alcohol     -0.033990   0.004118  -8.255 3.16e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1754 on 1597 degrees of freedom
## Multiple R-squared:  0.04092,    Adjusted R-squared:  0.04032 
## F-statistic: 68.14 on 1 and 1597 DF,  p-value: 3.155e-16

As we can see from above, the adjusted R-squared between volatile acidity and alcohol is small and this indicate that we can use both variable together to predict the quality of the wine.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

Price correlates weakly with volatile acidity and alcohol.

When alcohol percentage increases, the quality tends to increase. However, as quality increase, the alcohol percentage variance also increase.

Based on the R^2 value, alcohol explains about 23% of the variance in quality. Other feature of interest could be incorporated into the model to explain the variance in quality.

Red wine with high volatile acidity concentrate tend to have lower quality. Wine with quality of 3 all have volatile acidity concentrate higher than 0.7 g/dm^3 and most of wine with quality greater than 6 have volatile acidity concentrate lower than 0.5 g/dm^3. I suppose it is because high acidity lead to unpleasant taste.

Moreover, wine with higher concentration of sulphates tend to have higher quality.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

We can see stronger correlation between fixed acidity vs density and between fixed acidity and pH.

What was the strongest relationship you found?

The relation between fixed acidity and pH. As for the main feature, it will be that the quality of wine strongly correlated with alcohol and volatile acidity. Moreover since alcohol and volatile acidity does not seem to have correlation, we can use both variable later to predict the quality of the wine.

Multivariate Plots Section

In the plot above we plot the graph of median of volatile acidity vs alcohol for each quality value. In the graph above we can see that as quality decrease to 1, the volatile acidity tend to increase for some value of alcohol. However, it does not hold for all alcohol percentage, especially when the alcohol percentage is above 12. From the above graph, we can also see that each line looks random which show that there is no linear relation between alcohol and volatile acidity.

In the plot above, it seems that even though there is correlation between fixed acidity and pH, there is no clear separation of the color of point in the plot. It shows that both variable combined does not correlate to quality

This plot looks a bit better than the previous one as in there, we can see a bit of grouping in the location of each color.

## 
## Calls:
## m1: lm(formula = as.numeric(quality) ~ alcohol, data = wineData)
## m2: lm(formula = as.numeric(quality) ~ alcohol + volatile.acidity, 
##     data = wineData)
## m3: lm(formula = as.numeric(quality) ~ alcohol + volatile.acidity + 
##     sulphates, data = wineData)
## m4: lm(formula = as.numeric(quality) ~ alcohol + volatile.acidity + 
##     sulphates + citric.acid, data = wineData)
## m5: lm(formula = as.numeric(quality) ~ alcohol + volatile.acidity + 
##     sulphates + citric.acid + density, data = wineData)
## 
## ==========================================================================================
##                          m1            m2            m3            m4            m5       
## ------------------------------------------------------------------------------------------
##   (Intercept)          -0.125         1.095***      0.611**       0.646**     -14.504     
##                        (0.175)       (0.184)       (0.196)       (0.201)      (11.964)    
##   alcohol               0.361***      0.314***      0.309***      0.309***      0.323***  
##                        (0.017)       (0.016)       (0.016)       (0.016)       (0.019)    
##   volatile.acidity                   -1.384***     -1.221***     -1.265***     -1.301***  
##                                      (0.095)       (0.097)       (0.113)       (0.116)    
##   sulphates                                         0.679***      0.696***      0.680***  
##                                                    (0.101)       (0.103)       (0.104)    
##   citric.acid                                                    -0.079        -0.155     
##                                                                  (0.104)       (0.120)    
##   density                                                                      15.106     
##                                                                               (11.927)    
## ------------------------------------------------------------------------------------------
##   R-squared             0.227         0.317         0.336         0.336         0.337     
##   adj. R-squared        0.226         0.316         0.335         0.334         0.335     
##   sigma                 0.710         0.668         0.659         0.659         0.659     
##   F                   468.267       370.379       268.912       201.777       161.803     
##   p                     0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood    -1721.057     -1621.814     -1599.384     -1599.093     -1598.288     
##   Deviance            805.870       711.796       692.105       691.852       691.157     
##   AIC                3448.114      3251.628      3208.768      3210.186      3210.576     
##   BIC                3464.245      3273.136      3235.654      3242.448      3248.216     
##   N                  1599          1599          1599          1599          1599         
## ==========================================================================================

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Alcohol and volatile acidity seems to strengthen each other. Moreover, density seems to also strengthen alcohol.

Were there any interesting or surprising interactions between features?

The interaction between alcohol and density seems to able to contribute to the quality of the wine. However, it is not strong enough as some different quality value still override each other in the scatter plot we draw.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

Yes, I created 5 model to predict the quality of the wine. At first, I only use alcohol, in which I obtain 0.226 as the adjusted R-squared. After that, I added volatile acidity, after which the adjusted R-squared increase a lot while making all the variables as still important. After that, I added sulphates, which also increase the R-squared, while keeping all variable important. After that, I added citric acid and density, after which the R-squared did not improved a lot and the variables added is marked as unimportant.


Final Plots and Summary

Plot One

Description One

From the plot above, we can see that our data of wine quality is imbalanced. There are a lot more ordinary wine (those with quality of 5 - 6) than the good ones ( quality of 7- 8 ) or the bad ones (3 - 4). It is probably because the ordinary wine is the most popular one as it is not as expensive as the good ones but is still tasty.

Plot Two

Description Two

In this plot, there seems to be a tendency for alcohol percentage to increase as the quality of wine increase. However, there is some anomality when the quality is 5, in which the alcohol percentage average is lower than when the quality is 4. Moreover, there are also some outlier spotted when the quality is 5.

Plot Three

Description Three

The plot above shows that it may be possible to predict quality using alcohol and volatile acidity. It is because when comparing the line for low quality and high quality, we can see clear separation, e.g when quality is 3 vs 8. Thus, we can see that it would be easy to separate low quality wine from the high quality ones. However, the relationship may not be a linear. When we want to categorise the wine, it may be better to use logistic regression or SVM or K-means.


Reflection

The red wine data set contains information on almost 2000 thousand red wines across 12 variables from around 2009 I started by understanding the individual variables in the data set, and then I explored interesting questions and leads as I continued to make observations on plots. Eventually, I explored the quality of red wine across many variables and created a linear model to predict the quality.

There was a clear trend between the alcohol or volatile acidity of wine and its quality. I was surprised that pH or free sulfur dioxide did not have a strong negative correlation with quality, but these variables are likely to be represented by sulphates. I struggled understanding the outliers in the box plot that usually occured when the quality is 5 or 6, but this became more clear when I realized that most of the data has quality of 5 to 6. For the linear model, all red wine were included since information on quality, volatile acidity, alcohol, sulphates, citric acidity, and density were available for all row. After fitting the linear model without transforming the variables, the model was able to account for 33.7% of the variance in the dataset.

Some limitation of this model include the fact that we use linear regression to fit a factor variable. In this case, I think it may be better to separate the quality into 2 factor, good or bad and fit a logistic regression / SVM / K-means model into the data. It is because it is easier to separate good and bad wine first before we mark the quality of the wine.